general metric space
Tight Bounds On The Distortion of Randomized and Deterministic Distributed Voting
We study metric distortion in distributed voting, where nvoters are partitioned into k groups, each selecting a local representative, and a final winner is chosen from these representatives (or from the entire set of candidates). This setting models systems like U.S. presidential elections, where state-level decisions determine the national outcome. We focus on four cost objectives from Anshelevich et al. [1]: avg-avg, avg-max, max-avg, and max-max. We present improved distortion bounds for both deterministic and randomized mechanisms, offering a near-complete characterization of distortion in this model. For deterministic mechanisms, we reduce the upper bound for avg-max from 11 to 7, establish a tight lower bound of 5 for max-avg (improving on 2+ 5), and tighten the upper bound for max-max from 5 to 3. For randomized mechanisms, we consider two settings: (i) only the second stage is randomized, and (ii) both stages may be randomized. In case (i), we prove tight bounds: 5 2/k for avg-avg, 3for avg-max and max-max, and 5for max-avg. In case (ii), we show tight bounds of 3 for max-avg and max-max, and nearly tight bounds for avg-avg and avg-max within [3 2/n, 3 2/(kn)]and [3 2/n, 3], respectively, where n denotes the largest group size.
k-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy
We propose a new initialization scheme for the k-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. We propose a novel and efficient search algorithm which finds initial centers that can be used subsequently for the local search algorithm. The so-called HST initialization method can produce initial centers achieving lower error than those from another popular method k-median++, also with higher efficiency when k is not too small. Our HST initialization are then extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error of applying DP local search followed by our private HST initialization improves prior results on the approximation error, and approaches the lower bound within a small factor. Experiments demonstrate the effectiveness of our proposed methods.
Active Nearest-Neighbor Learning in Metric Spaces
Aryeh Kontorovich, Sivan Sabato, Ruth Urner
We propose a pool-based non-parametric active learning algorithm for general metric spaces, called MArgin Regularized Metric Active Nearest Neighbor (MARMANN), which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of MARMANN is significantly lower than that of any passive learner with similar error guarantees. Our algorithm is based on a generalized sample compression scheme and a new label-efficient active model-selection procedure.
Learning Augmented Graph $k$-Clustering
Clustering is a cornerstone of unsupervised machine learning, widely applied in fields such as data organization, anomaly detection, and community detection in networks [Xu and Wunsch, 2005]. Among clustering problems, the k -means and k -median problems stand out as fundamental due to their simplicity and effectiveness. Traditional algorithms aim to partition data into k clusters, minimizing either the sum of squared distances (k-means) or the sum of absolute distances (k-median) to their respective cluster centers. The k -means algorithm has been a cornerstone of clustering research for decades, tracing its roots to foundational works by [MacQueen, 1967] and [Lloyd, 1982], who introduced the iterative optimization approach still used today. Extensions by [Hartigan and Wong, 1979] improved convergence, while [Forgy, 1965] proposed widely-used initialization techniques. The optimization principles underlying k -means were influenced by earlier algorithmic developments, such as Floyd's contributions to optimization [Floyd, 1962]. Improvements include k -means++ [Arthur and Vassilvitskii, 2007], which introduced a probabilistic seeding strategy to improve initialization quality and convergence, and Mini-Batch k -means[Sculley, 2010], which enabled clustering on massive datasets with reduced computational overhead.
Improved Guarantees for Fully Dynamic k -Center Clustering with Outliers in General Metric Spaces
The metric k -center clustering problem with z outliers, also known as (k,z) -center clustering, involves clustering a given point set P in a metric space (M,d) using at most k balls, minimizing the maximum ball radius while excluding up to z points from the clustering. This problem holds fundamental significance in various domains such as machine learning, data mining, and database systems.This paper addresses the fully dynamic version of the problem, where the point set undergoes continuous updates (insertions and deletions) over time. The objective is to maintain an approximate (k,z) -center clustering with efficient update times. We propose a novel fully dynamic algorithm that maintains a (4 \epsilon) -approximate solution to the (k,z) -center clustering problem that covers all but at most (1 \epsilon)z points at any time in the sequence with probability 1-k/e {\Omega(\log k)} . The algorithm achieves an expected amortized update time of \mathcal{O}(\epsilon {-2} k 6\log(k) \log(\Delta)), and is applicable to general metric spaces.
Unlocking Point Processes through Point Set Diffusion
Lรผdke, David, Raventรณs, Enric Rabasseda, Kollovieh, Marcel, Gรผnnemann, Stephan
Point processes model the distribution of random point sets in mathematical spaces, such as spatial and temporal domains, with applications in fields like seismology, neuroscience, and economics. Existing statistical and machine learning models for point processes are predominantly constrained by their reliance on the characteristic intensity function, introducing an inherent trade-off between efficiency and flexibility. In this paper, we introduce Point Set Diffusion, a diffusion-based latent variable model that can represent arbitrary point processes on general metric spaces without relying on the intensity function. By directly learning to stochastically interpolate between noise and data point sets, our approach enables efficient, parallel sampling and flexible generation for complex conditional tasks defined on the metric space. Experiments on synthetic and real-world datasets demonstrate that Point Set Diffusion achieves state-of-the-art performance in unconditional and conditional generation of spatial and spatiotemporal point processes while providing up to orders of magnitude faster sampling than autoregressive baselines.
Active Nearest-Neighbor Learning in Metric Spaces
We propose a pool-based non-parametric active learning algorithm for general metric spaces, called MArgin Regularized Metric Active Nearest Neighbor (MARMANN), which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of MARMANN is significantly lower than that of any passive learner with similar error guarantees. Our algorithm is based on a generalized sample compression scheme and a new label-efficient active model-selection procedure.
Universal Weak Coreset
Coresets for $k$-means and $k$-median problems yield a small summary of the data, which preserve the clustering cost with respect to any set of $k$ centers. Recently coresets have also been constructed for constrained $k$-means and $k$-median problems. However, the notion of coresets has the drawback that (i) they can only be applied in settings where the input points are allowed to have weights, and (ii) in general metric spaces, the size of the coresets can depend logarithmically on the number of points. The notion of weak coresets, which have less stringent requirements than coresets, has been studied in the context of classical $k$-means and $k$-median problems. A weak coreset is a pair $(J,S)$ of subsets of points, where $S$ acts as a summary of the point set and $J$ as a set of potential centers. This pair satisfies the properties that (i) $S$ is a good summary of the data as long as the $k$ centers are chosen from $J$ only, and (ii) there is a good choice of $k$ centers in $J$ with cost close to the optimal cost. We develop this framework, which we call universal weak coresets, for constrained clustering settings. In conjunction with recent coreset constructions for constrained settings, our designs give greater data compression, are conceptually simpler, and apply to a wide range of constrained $k$-median and $k$-means problems.
$k$-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy
Fan, Chenglin, Li, Ping, Li, Xiaoyun
When designing clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. In this paper, we develop a new initialization scheme, called HST initialization, for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. Our proposed HST initialization can produce initial centers achieving lower errors than those from another popular initialization method, $k$-median++, with comparable efficiency. The HST initialization can also be extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error from applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments justify the theory and demonstrate the effectiveness of our proposed method. Our approach can also be extended to the $k$-means problem.